Linear Regression
Used for predicting continuous values:
- Simple: \(y = \alpha + \beta x\)
- Multiple: \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\)
- Polynomial: \(y = \alpha + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n\)
- and many others…
Logistic Regression
Used for binary classification problems:
- Binomial: only two possible categories
- Multinomial: three or more possible categories
- Ordinal: three or more possible categories which are ordered
Decision Trees
A tree-like structure used for both classification and regression
Random Forests
An ensemble method that combines multiple decision trees:
- Train independently \(B\) trees using:
- Bagging: each tree is fitted on a random subset of the training set
- Feature bagging: each split in the decision tree (i.e. each node) is chosen among a subset of the features
- Take a decision by aggregating individual decisions of each tree
Boosting
An ensemble method that combines weak learners (usually decision trees) to form a stronger model:
- Choose a simple base learner (e.g. small decision trees with fixed number of leaves)
- Repeatedly:
- Train a new base learner on the weighted training set
- Add this new learner to the ensemble
- Give more weight in the training set to misclassified data
\(k\)-Nearest Neighbors (KNN)
A non-parametric method for classification and regression
Naive Bayes
A probabilistic classifier based on Bayes’ theorem
Support Vector Machines (SVM)
Used for classification and regression, effective in high-dimensional spaces:
- Separate the feature space using optimal hyperplanes
- Features are mapped in a higher dimensional space to allow to fit non-linearly in the original feature space
\(k\)-Means Clustering
A method for partitioning data into \(k\) clusters:
- \(k\) must be chosen before
- Create clusters around iteratively improved center points
- Classical method with lots of variation
Hierarchical Clustering
Builds a hierarchy of clusters using either agglomerative or divisive methods:
- Build a full hierarchy top-down (divisive) or bottom-up (agglomerative)
- Create any number of clusters by cutting the tree
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Clustering based on the density of data points:
- Divides points in 4 categories: core points (in red below), directly reachable (yellow), reachable and outliers (blue)
- Only two parameters: radius size (\(\epsilon\)) and number of neighbors to be core (\(min_{pts}\))
Principal Component Analysis (PCA)
Dimensionality reduction technique to project data into lower dimensions:
- Project data into a space of lower dimension
- Keep as much variance (so as much information) as possible
t-Distributed Stochastic Neighbor Embedding (t-SNE)
A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data
Q-Learning
A value-based reinforcement learning algorithm:
- Learn the expected output of each action in each situation
- Limited to discrete and simple environments
- Neural Network variants (like Deep Q-Learning) allow to handle more complex environments
State-Action-Reward-State-Action (SARSA)
The On Policy version of Q-Learning (which is Off Policy). On Policy and Off Policy differ on the policy that is used during training to take the decisions:
- The exact same policy that is currently learnt for On Policy
- Another policy for Off Policy
Policy Gradient Methods
Optimize the policy by adjusting parameters in the direction that improves performance (e.g., REINFORCE algorithm)
Actor-Critic Methods
Combination of an Actor and a Critic learning simultaneously:
- The Actor learns the policy through a parametrized function
- The Critic estimates the value of each action and gives feedback to the Actor
Introduction to Neural Networks (NN)
Definition
Neural Network (NN)
Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.
Basic Elements
Neuron
Takes multiple inputs, sums them with weights and passes the result as output.
Layer
Set of similar neurons taking different inputs and/or having different weights.
Neural Network (NN)
Sequence of layers.
Linear Functions
Linear Function
Function that can be written like this: \[
f(\alpha_1, \cdots, \alpha_n) = (\beta_{1,1} \alpha_1 + \cdots + \beta_{1,n} \alpha_n, \cdots, \beta_{m,1} \alpha_1 + \cdots + \beta_{m,n} \alpha_n)
\]
Composition of Linear Functions
The composition of any number of linear functions is a linear function.
Activation Functions - Definition
Activation Function
Function applied to the output of a NN layer (i.e. to the output of each of its neurons) to introduce non-linearity to the model.
Activation functions allow to approximate much more complex functions, using a sequence of intertwined affine layers and activation layers.
Activation Functions - Examples
Fully Connected
The most basic layer, in which each output is a linear combination of each input (before the activation layer)
Convolutional
A layer combining geographically close features, used a lot to process rasters.
Recurrent
Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.
The two main variants of recurrent layers are:
Nowadays, transformer architectures are however preferred to process sequential data.
Pooling
A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.
Residual
A Residual Block aims at stabilizing training and convergence of deep neural networks (with a large number of layers), by adding the input of a given layer to the output of another layer further down in the architecture.
Attention
Attention aims at determining relative importance of each part of the input to make better predictions. It is used a lot in natural language processing (NLP) and image processing.
And a lot more…
- Dropout: randomly drop out some of the nodes during training to reduce overfitting
- Batch Normalization: normalize the input of each layer across the batch to improve training stability and speed
- Layer Normalization: normalize the input of each layer across the features to improve training stability and speed
- Embedding: transforms discrete input data into continuous vectors with lower-dimensional space
- Flatten: convert multi-dimensional data into 1D data that can be fed into fully connected layers
- …
Loss Function - Definition
A loss function is a mathematical function that quantifies the difference between the network’s predicted output and the actual target values. The goal during training is to minimize this loss by adjusting the model’s weights, using gradient descent.
The most common loss functions are:
- Mean Squares Error (MSE) for regression
- Cross-entropy Loss for classification
Loss Function - Differentiable
To be able to perform gradient descent, the loss function must be differentiable, which means continuous (no jump) and smooth (no sudden change of direction).
Loss Function - Convex
To get the best results when performing gradient descent, it is also better if the function is convex. The simplest definition of convexity is that if you trace a straight line between two points on the curve, the curve will be below the segment between the two points.
![]()
Example of convex and non-convex functions[1]
Gradient Descent - Definition
Gradient Descent
The process of iteratively computing and following the direction of the gradient of a function to (hopefully) reach the minimum value of the function (if it exists).
Gradient Descent works because at any point in the definition space of the function, the gradient points in the direction of the steepest angle. So locally, following this direction is the quickest way to get to the lower value of the function. If we come back to the requirements listed before:
- Differentiable functions are functions where the gradient exists everywhere
- Convex functions are convenient for gradient descent because they have only one minimum value and slowly going down the function will always lead to the minimum value.
Gradient Descent - Algorithm
Gradient Descent boils down to iteratively:
- Compute the gradient of the loss function at the current point
- Make a step towards the direction of the gradient to a new point
- Repeat step 1 until we stop
In this process, the three things that have to be defined are:
- The starting point (weights initialization)
- The size of the steps (learning rate)
- The condition to stop
Gradient Descent - Weights initialization
The starting point is defined by the first output of the model, and therefore by the initial values of the weights of the model. There are numerous methods to initialize the weights, but the most common one is to randomly initialize them using a centered and normalize Gaussian distribution.
Gradient Descent - Learning rate
The gradient gives us a direction and a norm, but this norm is arbitrary and has to be rescaled using what we call the learning rate. The learning rate doesn’t define the size of the steps, but the scalar factor to apply to the gradient’s norm, which means that the norm still plays a crucial role.
The choice of the learning rate is crucial to hopefully converge quickly to the global minimum loss.
![]()
Example of gradient descent on the same function with different learning rates[1]
Gradient Descent - Stop condition
The stop condition determines when you decide to stop the algorithm. An easy solution is to choose a number of steps before launching the algorithm, but this will either imply useless computations after the algorithm has reached a final point, or stopping too early and not get the best results possible.
Therefore, although there are more complex methods, the most common and simple process is to monitor the value of the loss, memorize the lowest value ever reached, and stop when there has been a given number of steps without any improvement to the best value. Then, we usually keep the model weights corresponding to this best value.
Gradient Descent - Unlucky examples
Backpropagation
Gradient Descent is beautiful, but right now, we only know in which direction (the gradient) the output of the model should go. To transmit this information to the weights of the layers of the model, we use backpropagation.
Backpropagation
The process of computing the gradient of the weights of each layer of the model and modify them accordingly. The name comes from the process starting with the last layer of the model and propagating incrementally to the first layer.
Feedforward Neural Networks (FNNs)
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Generative Adversarial Networks (GANs)
Overview
- Data acquisition
- Data preprocessing
- Model selection
- Model evaluation
- Final model training
Data acquisition
Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.
Different issues
Multiple sources of issues and steps to perform: 1. Handle different formats 2. Remove outliers (mostly for raw data) 3. (Optionally) extract features 4. Handle missing data 5. Normalize
Why normalization?
Idea
A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.
Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:
\[
\begin{align*}
\hat{X} & = \sum\limits_{j=0}^n X_j \\
\sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\
\forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X}
\end{align*}
\]
Model selection
- Type of model (ML, NN, DL, …)
- Complexity:
- Number of features
- Type of output
- Size of the layers (for NN)
- Number of layers (for NN)
- Hyperparameters
Model training
- Loss selection: depends on the task, the objectives, the specific issues to solve
- Training process selection (lots of different tweaks and improvements can be implemented in NN training)
- Hyperparameter tuning, by repeatedly:
- Selecting one or multiple configurations of hyperparameters
- Training the model one or multiple times
- Determining the best hyperpatameters
Model evaluation - Criteria
Criteria selection among the many possible ones:
- For classification:
- Accuracy: for balanced datasets
- Precision: when false positives are costly
- Recall: when false negatives are costly
- F1-Score: when class distribution is unbalanced
- …
- For regression:
- Mean Absolute Error (MAE)
- Mean Square Error (MSE): more sensitive to large errors than MAE
- …
Model evaluation - Cross-validation
Cross-validation
Method to estimate real performance of the model by: 1. Splitting the dataset in multiple parts (usually 5) 2. For different combinations of these parts (usually 5), training and evaluating the model
Final model training
Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.